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Abstract 

We present data-oblivious algorithms in the external-memory model for compaction, selection, and 
sorting. Motivation for such problems comes from clients who use outsourced data storage services 
and wish to mask their data access patterns. We show that compaction and selection can be done data- 
obliviously using 0{N/B) I/Os, and sorting can be done, with a high probability of success, using 

0{{N/B)\o 

SM/si-^/^)) I/Os. Our methods use a number of new algorithmic techniques, including 
data-oblivious uses of invertible Bloom lookup tables, a butterfly-like compression network, randomized 
data thinning, and "shuffle-and-deal" data perturbation. In addition, since data-oblivious sorting is the 
bottleneck in the "inner loop" in existing oblivious RAM simulations, our sorting result improves the 
amortized time overhead to do oblivious RAM simulation by a logarithmic factor in the external-memory 
model. 



1 Introduction 

Online data storage is becoming a large and growing industry, in which companies provide large storage 
capabilities for hosting outsourced data for individual or corporate users, such as with Amazon S3 and 
Google Docs. As an example of the size and growth of this industry, we note that in March 2010 the Amazon 
S3 service passed the milestone of hosting over 100 billion objects which was double the number from 
the year before. 

The companies that provide such services typically give their users guarantees about the availability and 
integrity of their data, but they often have commercial interests in learning as much as possible about their 
users' data. For instance, it may be in such a company's commercial interest to search a user's documents 
for keywords that could trigger advertisements that are better targeted towards a user's interests. Of course, 
a significant portion of a user's data might be sensitive, either for personal or proprietary reasons; hence, 
there is a need for techniques for keeping a user's data private in such circumstances. 

Clearly, a necessary first step towards achieving privacy is for a user to store his or her data in encrypted 
form, e.g., using a secret key that is known only to the user. Simply encrypting one's data is not sufficient to 
achieve privacy in this context, however, since information about the content of a user's data is leaked by the 
pattern in which the user accesses it. For instance, Chen et al. [15] show that highly sensitive information 
can be inferred from user access patterns at popular health and financial web sites even if the contents of 
those communications are encrypted. 

www. datacenter knowledge . com/ archives/ 2010/03/ 9/amazon-s3-now-hosts- 100-billion-ob jects/ 
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Problem Statement. The problem of protecting the privacy of a user's data accesses in an outsourced 
data storage facility can be defined in terms of external-memory data-oblivious RAM computations. In this 
framework, a user, Alice, has a CPU with a private cache, which can be used for data-dependent compu- 
tations. She stores the bulk of her data on a data storage server owned by Bob, who provides her with an 
interface that supports indexed addressing of her data, so each word has a unique address. As in the standard 
external-memory model (e.g., see |T|), where external storage is typically assumed to be on a disk drive, data 
on the external storage device is accessed and organized in contiguous blocks, with each block holding B 
words, where B > 1. That is, this is an adaptation of the standard external-memory model so that the CPU 
and cache are associated with Alice and the external disk drive is associated with Bob. The service provider. 
Bob, is trying to learn as much as possible about the content of Alice's data and he can view the sequence 
and location of all of Alice's disk accesses. But he cannot see the content of what is read or written, for we 
assume it is encrypted using a semantically secure encryption scheme such that re-encryption of the same 
value is indistinguishable from an encryption of a different value. Moreover, Bob cannot view the content 
or access patterns of Alice's private cache, the size of which, M, is allowed to be function of the input size, 
N, and/or, the block size, B. Using terminology of the cryptography literature. Bob is assumed to be an 
honest-but-curious adversary 11211 . in that he correctly performs all protocols and does not tamper with any 
data, but he is nevertheless interested in learning as much as possible about Alice's data. 

We say that Alice's sequence of I/Os is data-oblivious for solving some problem, V (like sorting or 
selection), if the distribution of this sequence depends only on V, N, M, and B, and the length of the 
access sequence itself. In particular, this distribution should be independent of the data values in the input. 
Put another way, this definition of a data-oblivious computation means that Pi{S \ A4,V, N, M, B), the 
probability that Bob sees an access sequence, S, conditioned on a specific initial configuration of external 
memory, M., and the values of V, N, M, and B, satisfies 

Ft{S\M,V,N,M,B) = Pt{S\M',V,N,M,B), 

for any memory configuration A4' ^ M. such that \M!\ = \M.\. Note, in particular, that this implies that the 
length, |5|, of Alice's access sequence cannot depend on specific input values. Examples of data-oblivious 
access sequences for an array. A, of size N , in Bob's external memory, include the following (assuming 

B = \y. 

• Simulating a circuit, C, with its inputs taken in order from A. For instance, C could be a Boolean 
circuit or an AKS sorting network [2, 3). 

• Accessing the cells of A according to a random hash function, h(i\ as yl[/i(2)], . . ., 
or random permutation, 7r(z), as A[7r(l)], ^[7r(2)], . . ., 74[7r(n)]. 

Examples of computations on A that would not be data-oblivious include the following: 

• Using the standard merge-sort or quick-sort algorithm to sort A. (Neither of these well-known algo- 
rithms is data-oblivious.) 

• Using values in A as indices for a hash table, T, and accessing them as r[/i(^[l])], r[/i(y4[2])], . . ., 
T[/i(A[n])], where /i is a random hash function. (For instance, think of what happens if all the values 
in A are equal and how improbable the resulting n-way collision in T would be.) 

This last example is actually a little subtle. It is, in general, not data-oblivious to access a hash table, T, 
in this way, but note that such an access sequence would be data-oblivious if the elements in A are always 
guaranteed to be distinct. In this case, each such access in T would be to a random location and every 
possible sequence of such accesses in T would be equally likely for any set of distinct values in A, assuming 
our random hash function, h, satisfies the random oracle model (e.g., see [6|). 
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Related Prior Results. Data-oblivious sorting is a classic algorithmic problem, which Knuth llTTll studies 
in some depth, since deterministic schemes give rise to sorting networks, such as the theoretically-optimal 
0(n log n) AKS network [ 2j |3] |38l |42l as well as practical sorting networks ||29ll39|. Randomized data- 
oblivious sorting algorithms running in 0(n log n) time and succeeding with high probability are likewise 
studied by Leighton and Plaxton ||3T]| and Goodrich [23]. Moreover, data-oblivious sorting is finding ap- 
plications to privacy-preserving secure multi-party computations (44). For the related selection problem, 
Leighton et al. |30| give a randomized data-oblivious solution that runs in O(nloglogn) time and they 
give a matching lower bound for methods exclusively based on compare-exchange operations. In addi- 
tion, we note that the data-oblivious compaction problem is strongly related to the construction of (n, m)- 
concentrator networks lITTl . 

Chaudhry and Gormen fT45 argue that data-oblivious external-memory sorting algorithms have several 
advantages, including good load-balancing and the avoidance of poor performance on "bad" inputs. They 
give a data-oblivious sorting algorithm for the parallel disk model and analyze its performance experimen- 
tally [T3l. But their method is size-limited and does not achieve the optimal I/O complexity of previous 
external-memory sorting algorithms (e.g., see [IJ), which use 0{{N/ B) log^/i /^{N/B)) I/Os. None of the 
previous external-memory sorting algorithms that use 0{{N / B) log j ^{N / B)) I/Os are data-oblivious, 
however. 

Goodrich and Mitzenmacher |'24l show that any RAM algorithm. A, can be simulated in a data-oblivious 
fashion in the external-memory model so that each memory access performed by A has an overhead of 
0(min{log2(iV/B), lo g\.j^Q{N/B) log2 N}) amortized I/Os in the simulation, and which is data-oblivious 
with high probability (w.v.h.p.Q which both improves and extends a result of Goldreich and Ostrovsky || 22l 
to the I/O model. The overhead of the Goodrich-Mitzenmacher simulation is optimal for the case when 
N/B is polynomial in M/B and is based on an "inner loop" use of a deterministic data-oblivious sorting 
algorithm that uses 0{{N/B) log\.j^^{N/B)) I/Os, but this result is not optimal in general. 

Our Results. We give data-oblivious external-memory algorithms for compaction, selection, quantile 
computation, and sorting of outsourced data. Our compaction and selection algorithms use 0{N/B) I/Os 
and our sorting algorithm uses 0{{N/B) logj^j/^{N/B)) I/Os, which is asymptotically optimal [IJ. All our 
algorithms are randomized and succeed with high probability. Our results also imply that one can improve 
the expected amortized overhead for a randomized external-memory data-oblivious RAM simulation to be 

Our main result is for the sorting problem, as we are not aware of any asymptotically-optimal oblivious 
external-memory sorting algorithm prior to our work. Still, our other results are also important, in that our 
sorting algorithm is based on our algorithms for selection and quantiles, which in turn are based on new 
methods for data-oblivious data compaction. We show, for instance, that efficient compaction leads to a 
selection algorithm that uses 0{n) I/Os, and succeed w.v.h.p, where n = 0{N/B). Interestingly, this result 
beats a lower bound for the complexity of parallel selection networks, of r2(nloglogn), due to Leighton 
et al. [30|. Our selection result doesn't invalidate their lower bound, however, for their lower bound is for 
circuits that only use compare-exchange operations, whereas our algorithm uses other "blackbox" primitive 
operations besides compare-exchange, including addition, subtraction, data copying, and random functions. 
Thus, our result demonstrates the power of using operations other than compare-exchange in a data-oblivious 
selection algorithm. 

Our algorithms are based on a number of new techniques for data-oblivious algorithm design, including 
the use of a recent data structure by Goodrich and Mitzenmacher f25], known as the invertible Bloom 
lookup table. We show in this paper how this data structure can be used in a data-oblivious way to solve 

^In this paper, we take the phrase "with high probabihty" to mean that the probabihty is at least 1 — 1/{N/By'', for a given 
constant d > 1. 
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the compaction problem, which in turn leads to efficient methods for selection, quantiles, and sorting. Our 
methods also use a butterfly-like routing network and a "shuffle-and-deal" technique reminiscent of Valiant- 
Brebner routing |43|, so as to avoid data-revealing access patterns. 

Given the motivation of our algorithms from outsourced data privacy, we make some reasonable assump- 
tions regarding the computational models used by some of our algorithms. For instance, for some of our 
results, we assume that B > log'^{N/ B), which we call the wide-block assumption. This is, in fact, equiva- 
lent or weaker than several similar assumptions in the external-memory literature (e.g., see ir7l [T0l[33ll37ll40l 
mmSl). In addition, we sometimes also use a weak tall-cache assumption that M > 5^+*^, for some small 
constant e > 0, which is also common in the external-memory literature (e.g., see ||9l|TT]|5l|7l[T2j[T9l). For 
example, if we take e = 1/2 and Alice wishes to sort an array of up to 2^^ 64-bit words (that is, a zettabyte 
of data), then, to use our algorithms, she would need a block size of at least 8 words (or 64 bytes) and a 
private cache of at least 23 words (or 184 bytes). In addition, we assume throughout that keys and values 
can be stored in memory words or blocks of memory words, which support the operations of read, write, 
copy, compare, add, and subtract, as in the standard RAM model. 

2 Invertible Bloom Filters 

As mentioned above, one of the tools we use in our algorithms involves a data-oblivious use of the invertible 
Bloom lookup table of Goodrich and Mitzenmacher ||25ll . which is itself based on the invertible Bloom filter 
data structure of Eppstein and Goodrich 1 18|. 

An invertible Bloom lookup table, B, is a randomized data structure storing a set of key-value pairs. It 
supports the following operations (among others): 

• insert(x, y): insert the key-value pair, (x, y), into B. This operation always succeeds, assuming that 
keys are distinct. 

• delete(x, y): remove the key-value pair, (x, y), from B. This operation assumes (x, y) is in B. 

• get(x): lookup and return the value, y, associated with the key, x, in B. This operation may fail, with 
some probability. 

• listEntries: list all the key- value pairs being stored in B. With low probability, this operation may 
return a partial list along with an "list-incomplete" error condition. 

When an invertible-map Bloom lookup table B is first created, it initializes a table, T, of a specified 
capacity, m, which is initially empty. The table T is the main storage used to implement B. Each of the 
cells in T stores a constant number of fields, each of which is a single memory word or block. Insertions 
and deletions can proceed independent of the capacity m and can even create situations where the number, 
n, of key-value pairs in B can be much larger than m. Nevertheless, the space used for B remains 0{m). 
The get and listEntries methods, on the other hand, only guarantee good probabilistic success when n < m. 

hike a traditional Bloom filter ||8l, an invertible Bloom lookup table uses a set of k random hash func- 
tions, hi, /i2, . . ., /ifc, defined on the universe of keys, to determine where items are stored. In this case, we 
also assume that, for any x, the /ij(x) values are distinct, which can be achieved by a number of methods, 
including partitioning. Each cell contains three fields: 

• a count field, which counts the number of entries that have been mapped to this cell, 

• a key Sum field, which is the sum of all the keys that have been mapped to this cell, 

• a valueSum field, which is the sum of all the values that have been mapped to this cell. 
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Given these fields, which are all initially set to 0, performing the insert operation is fairly straightforward: 
• insert(2;, y): 

for each (distinct) hi{x) value, for z = 1, . . . , do 

add 1 to T[hi{x)]. count 
add X to T[/ij(x)].keySum 
add y to T[/ij(x)].valueSum 



It is a fairly straightforward exercise to implement this method in 0{m) time, say, by using a link-list-based 
priority queue of cells in T indexed by their count fields and modifying the delete method to update this 
queue each time it deletes an entry from B. If, at the end of the while-loop, all the entries in T are empty, 
then we say that the method succeeded. 

Lemma 1 (Goodrich and Mitzenmacher) Il25l : Given an invertible Bloom lookup table, T, of size m = 
\6kn\ , holding at most n key-value pairs, the listEntries method succeeds with probability 1 — 1/rf, where 
c > 1 is any given constant and k > 2 and 6 >2 are constant depending on c. 

The important observation about the functioning of the invertible Bloom lookup table, for the purposes 
of this paper, is that the sequence of memory locations accessed during an insert(x, y) method is oblivious to 
the value y and the number of items already stored in the table. That is, the locations accessed in performing 
an insert method depend only on the key, x. The listEntries method, on the other hand, is not oblivious to 
keys or values. 

3 Data-Oblivious Compaction 

In this section, we describe efficient data-oblivious algorithms for compaction. In this problem, we are 
given an array. A, of cells, at most R of which are marked as "distinguished" (e.g., using a marked bit 
or a simple test) and we want to produce an array, D, of size 0{R) that contains all the distinguished items 
from A. Note that this is the fundamental operation done during disk defragmentation (e.g., see BH ). which 
is a natural operation that one would want to do in an outsourced file system, since users of such systems 
are charged for the space they use. 

We say that such a compaction algorithm is order-preserving if the distinguished items in A remain in 
their same relative order in D. In addition, we say that a compaction algorithm is tight if it compacts A to 
an array D of size exactly R. If this size of D is merely 0{R), and we allow for some empty cells in D, 
then we say that the compaction is loose. Of course, we can always use a data-oblivious sorting algorithm, 
such as with the following sub-optimal result, to perform a tight order-preserving compaction. 

Lemma 2 (Goodrich and Mitzenmacher) Il24l : Given an array A of size N, one can sort A with a deter- 
ministic data-oblivious algorithm that uses 0{{N/B) logl,j^^{N/B)) I/Os, assuming B > 1 and M > 2B. 

In the remainder of this section, we give several compaction algorithms, which exhibit various trade-offs 
between performance and the compaction properties listed above. Incidentally, each of these algorithms is 
used in at least one of our algorithms for selection, quantiles, and sorting. 

We describe this method in a destructive fashion — if one wants a non-destructive method, then one should first create a copy 
of the table T as a backup. 



The delete method is basically the reverse of that above. The listEntries method is similarly 




• listEntries: 



while there is an i G [1, m] s. t. T[i]. count = 1 do 
output (T[i].k;eySum , r[i].valueSum) 
call dele te(r[z]. key Sum) 
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Data Consolidation. Each of our compaction algorithms uses a data-obUvious consoUdation preprocess- 
ing operation. In this operation, we are given an array A of size such that at most R < N elements 
in A are marked as "distinguished." The output of this step is an array, A', of IN/B] blocks, such that 
[R/B\ blocks in A' are completely full of distinguished elements and at most one block in A' is partially 
full of distinguished elements. All other blocks in A' are completely empty of distinguished elements. This 
consolidation step uses 0{N/B) I/Os, assuming only that B > 1 and M > 2B, and it is order-preserving 
with respect to the distinguished elements in A. 

We start by viewing A as being subdivided into \N/ B~\ blocks of size B each (it is probably already 
stored this way in Bob's memory). We then scan the blocks of A from beginning to end, keeping a block, 
X, in Alice's memory as we go. Initially, we just read in the first block of A and let x be this block. Each 
time after this that we scan a block, y, of A, from Bob's memory, if the distinguished elements from x 
and y in Alice's memory can form a full block, then we write out to Bob's memory a full block to A' of 
the first B distinguished elements in x U y, maintaining the relative order of these distinguished elements. 
Otherwise, we merge the fewer than B distinguished elements in Alice's memory into the single block, x, 
again, maintaining their relative order, and we write out an empty block to A' in Bob's memory. We then 
repeat. When we are done scanning A, we write out x to Bob's memory; hence, the access pattern for this 
method is a simple scan of A and A', which is clearly data-oblivious. Thus, we have the following. 

Lemma 3: Suppose we are given an array A of size N, with at most R of its elements marked as distin- 
guished. We can deterministically consolidate A into an array A' of IN/B] blocks of size B each with a 
data-oblivious algorithm that uses \N/ B~\ I/Os, such that all but possibly the last block in A' are completely 
full of distinguished elements or completely empty of distinguished elements. This computation assumes 
that B > 1 and M > 2B, and it preserves the relative order of distinguished elements. 

Tight Order-Preserving Compaction for Sparse Arrays. The semi-oblivious property of the invert- 
ible Bloom lookup table allows us to efficiently perform tight order-preserving compaction in a data- 
oblivious fashion, provided the input array is sufficiently sparse. Given the consolidation preprocessing 
step (Lemma |3]l, we describe this result in the RAM model (which could be applied to blocks that are 
viewed as memory words for the external-memory model), with n = N/B and r = R/B. 

Theorem 4: Suppose we are given an array, A, of size n, holding at most r < n distinguished items. 
We can perform a tight order-preserving compaction of the distinguished items from A into an array D of 
size r in a data-oblivious fashion in the RAM model in 0{n + r log^ r) time. This method succeeds with 
probability 1 — 1/r'^, for any given constant c > 1. 

Proof: We create an invertible Bloom lookup table, T, of size Sr. Then we map each entry, A[i], into T 
using the key-value pair (i, ^[i]). Note that the insertion algorithm begins by reading k cells of T whose 
locations in this case depend only on i (recall that k is the number of hash functions). If the entry A[i] is 
distinguished, then this operation involves changing the fields of these cells in T according to the invertible 
Bloom lookup table insertion method and then writing them back to T. If, on the other hand, A[i] is not 
distinguished, then we return the fields back to each cell T[hj{i)] unchanged (but re-encrypted so that Bob 
cannot tell of the cells were changed or not). Since the memory accesses in T are the same independent of 
whether A[i] is distinguished or not, each key we use is merely an index i, and insertion into an invertible 
Bloom lookup table is oblivious to all factors other than the key, this insertion algorithm is data-oblivious. 
Given the output from this insertion phase, which is the table T, of size 0{r), we then perform a RAM 
simulation of the listEntries method to list out the distinguished items in T, using the simulation result of 
Goodrich and Mitzenmacher |24|. The simulation algorithm performs the actions of a RAM computation 
in a data-oblivious fashion such that each step of the RAM computation has an amortized time overhead of 



6 



0(log^ r). Thus, simulating the listEntries method in a data-obUvious fashion takes 0(r log^ r) time and 
succeeds with probability 1 — 1/r'^, for any given constant c > 0, by Lemma[T] The size of the output array, 
D, is exactly r. Note that, at this point, the items in the array D are not necessarily in the same relative order 
that they were in A. So, to make this compaction operation order-preserving, we complete the algorithm 
by performing a data-oblivious sorting of D, using each item's original position in A as its key, say by 
Lemma m ■ 

Thus, we can perform tight data-oblivious compaction for sparse arrays, where r is 0(n/log^n), in 
linear time. Performing this method on an array that is not sparse, however, would result in an 0(nlog^ n) 
running time. Nevertheless, we can perform such an action on a dense array faster than this. 



Tight Order-Preserving Compaction. Let us now show how to perform a tight order-preserving com- 
paction for a dense array in a data-oblivious fashion using 0{{N/B) logj^j j ^{N / B)) I/Os. We begin by 
performing a data consolidation preprocessing operation (Lemma |3]). This allows us to describe our com- 
paction algorithm at the block level, for an array. A, of n blocks, where n = \N/ ff] , and a private memory 
of size m = 0(M/fi). 

Let us define a routing network, TZ, for performing this action, in such a way that TZ is somewhat like a 
butterfly network (e.g., see ||29]l ). There are [log n\ levels, Lq, Li, . . ., to this network, with each level, Li, 
consisting of n cells (corresponding to the positions in A). Cell j of level Li is connected to cell j and cell 
j — 2* of level Lj+i. (See Figure[l]) 

m cells 



/\ y\ A 



r 



y\ /\ VI 71 71 71 




Figure 1 : A butterfly-like compaction network. Occupied cells are shaded and labeled with the remaining 
distance that the block for that cell needs to move to the left. In addition, a block of m cells on level Lq is 
highlighted to show that its destination on a level 0(log m) away is of size roughly m/2. 



Initially, each occupied cell on level Lq is labeled with the number of cells that it needs to be moved to 
the left to create a tight compaction. Note that such a labeling can easily be produced by a single left-to-right 
data-oblivious scan of the array A. For each occupied cell j on level Lj, labeled with distance label, dj, we 
route the contents of cell j to cell j — {dj mod 2*+^), which, by a simple inductive argument, is either cell 
j or j — 2* on level Lj+i. We then update the distance label for this cell to be dj <r- dj — {dj mod 2*+^), 
and continue. 

Lemma 5: If we start with a valid set of distance labels on level Lq, there will be no collisions at any 
internal level of the network TZ. 

Proof: Notice that we can cause a collision between two cells, j and k, in going from level i to level i + 1, 
only if k = j + 2* and {dj mod 2*+^) = and {dk mod 2*+^) = 2*. So suppose such a pair of colliding 
cells exists. Then {d^ — dj) mod 2*+^ = 2\ Also, note that there are d^ — dj empty cells between k and 
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j, by the definition of tliese distance labels. Thus, there are at least 2* empty cells between j and k; hence, 
/c > J ■ + 2' + 1, which is a contradiction. So no such pair of colhding cells exists. ■ 

Thus, we can route the elements of A to their final destinations in O(nlogn) scans. We can, in fact, 
route elements faster than this, by considering cells 2m — 1 at a time, starting at a cell jm + 1 on level 
Lq. In this case, there are m possible destination cells at level L/ with index at least jm + 1, where 
I = i + logm — 1. So we can route all these m cells with destinations among these cells on level Li 
in internal memory with a single scan of these cells. We can then move this "window" of 2m — 1 cells 
to the right by m cells and repeat this routing. At the level Li we note that these consecutive m cells are 
now independent of one other, and the connection pattern of the previous log m — 1 levels will be repeated 
for cells that are at distance m apart. Thus, we can repeat the same routing for level Li (after a simple 
shuffle that brings together cells that are m apart). Therefore, we can route the cells from Lq to level Liogn 
using 0(n log n/ log m) = 0{{N/B)log]^f/^{N/B)) I/Os. Moreover, since we are simulating a circuit, 
the sequence of I/Os is data-oblivious. 

Theorem 6: Given an array A of N cells such that at most R < N of the cells in A are distinguished, we 
can deterministically perform a tight order-preserving compaction of A in a data-oblivious fashion using 
0{{N/B) \ogM/B{N/B)) I/Os, assuming B > 1 and M > 3B. 

Note that we can also use this method "in reverse" to expand any compact array to a larger array in an 
order-preserving way. In this case, each element would be given an expansion factor, which would be the 
number of cells it should be moved to the right (with these factors forming a non-decreasing sequence). 

Loose Compaction. Suppose we are now given an array A of size N, such that at most R < N/4 of the 
elements in A are marked as "distinguished" and we want to map the distinguished elements from A into an 
array D, of size 5R using an algorithm that is data-oblivious. 

We begin by performing the data consolidation algorithm of Lemma [3j to consolidate the distinguished 
elements of A to be full or empty block-sized cells (save the last one), where we consider a cell "empty" if 
it stores a null value that is different from any input value. So let us view ^ as a set of n = \N/ B] cells and 
let us create an array, C, of size 4r block-sized cells, for r = IR/B] . 

Define an A-to-C thinning pass to be a sequential scan of A such that, for each A[i], we choose an index 
j G [1, 4r] uniformly at random, read the cell C[j], and if it's empty, A[i] is distinguished, and we have 
yet to successfully write A[i] to C (which can be indicated with a simple bit associated with A[i]), then we 
write A[i] to the cell C[j]. Otherwise, if C[j] is nonempty or A[i] is not distinguished, then we write the 
old value of C[j] back to the cell C[j]. Thus, the memory accesses made by an A-to-C thinning pass are 
data-oblivious and the time to perform such a pass is 0{n). 

We continue our algorithm by performing cq rounds of A-to-C thinning passes, where co is a constant 
determined in the analysis, so the probability that any block of A is unsuccessfully copied into C is at most 
1/2^*^0. We then consider regions of A of size \ci logn] = 0(log(A^/-B)), where ci is another constant 
determined in the analysis, so that each such region has at most (cilogn)/2 occupied cells w.v.h.p. In 
particular, we assign these values according to the following. 

Lemma 7: Consider a region of c\ log{N/B) blocks of A, such that each block is unsuccessfully copied 
into C independently with probability at most 1 / 2^^" . The number of blocks of A that still remain in this 
region is over (ci / 2) log{N/B) with probability at most {N/ B) , provided cq > 3. 

Proof: The expected number of blocks that still remain is at most (ci/2^'^o) \og{N/ B). Thus, if we let X 
denote the number blocks that still remain, then, by a Chernoff bound (e.g.. Lemma [22| from the appendix) 
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and the fact that cq > 3, we have he following: 

Pr {X > (ci/2) log{N/B)) < T^^^l^) i°g(^/s) (2^0-3) 

^ log(7V/B) 

■ 

So this lemma gives us a lower bound for cq. In addition, if we take ci = d + 2, then the probability that 
there is any of our regions of size ci \og{N/ B) blocks with more than ci log(A^/ B)/2 blocks still remaining 
is at most {N / B)^'^'^^^\ which gives us a high probability of success. 

Since ci is a constant and the number of blocks we can fit in memory is m = M/B > log" {N/B) = 

2 

log*^ n, by our wide-block and tall-cache assumptions, we can apply the data-oblivious sorting algorithm of 
Lemma[2]to sort each of the 0{n/logn) regions using O(logn) I/Os a piece, putting every distinguished 
element before any unmarked elements. Thus, the overall number of I/Os for such a step is linear. Then, we 
compact each such region to its first (ci/2) logn blocks. This action therefore halves the size of the array, 
A. So we then repeat the above actions for the smaller array. We continue these reductions until the number 
of the remaining set of blocks is less than n / log^ n, at which point we completely compress the remaining 
array A by using the data-oblivious sorting algorithm of Lemma |2] on the entire array. This results in array 
of size at most r, which we concatenate to the array, C, of size 4r, to produce a compacted array of size 5r. 
Therefore, we have the following. 

Theorem 8: Given an array, A, of size N, such that at most R < N/A of A's elements are marked as 
distinguished, we can compress the distinguished elements in A into an array B of size 5R using a data- 
oblivious algorithm that uses 0{N/B) I/Os and succeeds with probability at least 1 — 1/{N/B)'^, for any 
given constant d> I, assuming B > \og^ (N/B) and M > 5^+", for some constant e > 0. 

Proof: To establish the claim on the number of I/Os for this algorithm, note that the I/Os needed for all the 
iterations are proportional to the geometric sum, 

n + n/2 + n/4 + • • • + n/ log^ n, 

which is 0(n) = 0{N/B). In addition, when we complete the compression using the deterministic sorting 
algorithm of Lemma|2j we use 

0((n/log^n) log^(n/log^n)) = 0{n) = 0{N/B) 

I/Os, under our weak wide-block and tall-cache assumptions. Thus, the entire algorithm uses 0{N/B) I/Os 
and succeeds with very high probability. ■ 

Incidentally, we can also show that it is also possible to perform loose compaction without the wide- 
block and tall-cache assumptions, albeit with a slightly super-linear number of I/Os. Specifically, we prove 
the following in the full version of this paper. 

Theorem 9: Given an array, A, of size N, such that at most R < N/A of A's elements are marked as 
distinguished, we can compress the distinguished elements in A into an array B of size 4.25i? using a 
data-oblivious algorithm that uses 0{{N/B) log* (N/B)) I/Os and succeeds with probability at least 1 — 
1 / {N/ BY, for any given constant d>l, assuming only that B > 1 and M > 2B. 
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4 Selection and Quantiles 



Selection. Suppose we are given an array A of n comparable items and want to find the A;th smallest 
element in A. For each element aj in A, we mark Oj as distinguished with probability 1/n^/^. Assuming 
that there are at most n^/^ + n^/^ distinguished items, we then compress all the distinguished items in A to 
an array, C, of size n^/^ + n^/^ using the method of Theorem[4] which runs in 0(n) time in this case and, 
as we prove, succeeds with high probability. We then sort the items in C using a data-oblivious algorithm, 
which can be done in 0(n^/^ log^ n) time by Lemma [2] considering empty cells as holding +00. We then 
scan this sorted array, C. During this scan, we save in our internal registers, items, x' and y', with ranks 
[/c/n^/^ — n^/®] and |C| — \{n — k)/n^/'^ — 2n^/^], respectively, in C, if they exist. If x' (resp., y') does 
not exist, then we set x' = —00 (resp., y' = +00). We then scan A to find x" and y", the smallest and 
largest elements in A, respectively. Then we set x = max{x', x"} and y = m.m{y' , y"}. We show below 
that, w.v.h.p., the kth smallest element is contained in the range [x, y] and there are 0(n^/*) items of A in 
this range. 

Lemma 10: There are more than n^/^ + n^/^ elements in C with probability at most e~"^'^*/^, and fewer 
than n^/^ — n^/® elements in C with probability at most e""^''*/^. 

Proof: Let X denote the number of elements of A chosen to belong to C. Noting that each element of A is 
chosen independently with probability to belong to C, by a standard Chernoff bound (e.g., see [34l . 

Theorem 4.4), 

Pr(X > + n^/*^) = Pr(X > (1 + n-^/'^)n^/2) 
since E{X) = n^/^. Also, by another standard Chernoff bound (e.g., see ll34l . Theorem 4.5), 

Pr(X < _ ^3/8) ^ p^^^ ^ (J _ ^-1/8)^1/2) 

■ 

Recall that we then performed a scan, where we save in our internal registers, items, x' and y', with 
ranks \k/n^/'^ — n^/^~\ and \C\ — \{n — k)/n^^'^ — n^/*], respectively, in C, if they exist. If x' (resp., y') 
does not exist, then we set x' = —00 (resp., y' = +00). We then scan A to find x" and y", the smallest and 
largest elements in A, respectively. Then we set x = max{x', x"} and y = min{y', y"}. 

Lemma 11: With probability at least 1 - 2e-"''''/9 - g-^'^'^'/s - e-"''"'/^ _ e"'^''/^, the kth smallest 
element of A is contained in the range [x, y] and there are at most 8n^/* items of A in the range [x,y]. 

Proof: Let us first suppose that k < 2n^/*. In this case, we pick the smallest element in A for x, in which 

case k > X. 

So, to consider the remaining possibility that x > k, for k > 2r{Jl'^ . We can model the number of 
elements of A less than or equal to x as the sum, X, of k' = kjn}!'^ — r?!'^ independent geometric random 
variables with parameter p = n~^/^. The probability that k is less than or equal to x is bounded by 

Pr(X > A;) = Pr(X > (n^/^ + t)fc'), 
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where t = n'^/^ /{k/n^/'^ - v?/^); hence, t > -n?/^ and k' > -n?/^. If t/n^/^ ^ ij2^ then, by a Chernoff 



bound (e.g.. Lemma 23 from the appendix), we have the following: 

Pr(X >k) < (riflr^)^'!'^ 

Similarly, if > 1/2, then 

Pr(X>/fc) < e-(*K/^)'='/9 
< e~ 



Thus, in either case, the probability that the A;th smallest element is greater than x is bounded by this latter 
probability. By a symmetric argument, assuming |C| > n^/^ — r?!^ , the probability that there are the more 
than {n — k) elements of A greater than y,iov k < n — 2rJ/^, is also bounded by this latter probability (note 
that, for A; > n — 2v7/^, it is trivially true that the A;th smallest element is less than or equal to y). So, with 
probability at least 1 — 2e~'^^'^/'^ — e~"^'^*/^, the A;th smallest element is contained in the range [z, y]. 

Let us next consider the number of elements of A in the range [x,y\. First, note that, probability at least 
1 — e"^ there are at most n' = A-n?/^ elements of the random sample, C, in this range; hence, we can 
model the number of elements of A in this range as being bounded by the sum, Y , of n' geometric random 



variables with parameter p = l/v}/'^. Thus, by a Chernoff bound (e.g.. Lemma 23 from the appendix), we 
have the following: 

Pr(y > Sn^/^) = Pr(y > (n^/^ + n^/'^)n ) 

_ g-4ri3/8/5 

This gives us the lemma. ■ 

So, we make an addition scan of A to mark the elements in A that are in the range [x, y], and we then 
compress these items to an array, D, of size 0(n^/^), using the method of TheoremUJ which runs in 0{n) 
time in this case. In addition, we can determine the rank, r{x), of x in A. Thus, we have just reduced the 
problem to returning the item in D with rank k — r{x) + 1. We can solve this problem by sorting D using 
the oblivious sorting method of Lemma [2] followed by a scan to obliviously select the item with this rank. 
Therefore, we have the following: 

Theorem 12: Given an integer, 1 < k < n, and an array, A, of n comparable items, we can select the kth 
smallest element in A in 0{n) time using a data-oblivious algorithm that succeeds with probability at least 
1 — n~'^, for any given constant d > 0. 

Note that the running time (with a very-high success probability) of this method beats the 0(n log log n) 
lower bound of Leighton et al. |[30l . which applies to any high-success-probability randomized data-oblivious 
algorithm based on the exclusive use of compare-exchange as the primitive data-manipulation operation. 
Our method is data-oblivious, but it also uses primitive operations of copying, summation, and random 



hashing. Thus, Theorem 12 demonstrates the power of using these primitives in data-oblivious algorithms. 
In addition, by substituting data-oblivious external-memory compaction and sorting steps for the internal- 
memory methods used above, we get the following: 

Theorem 13: Given an integer, I < k < N, and an array. A, of N comparable items, we can select the kth 
smallest element in A using 0{N/B) I/Os with a data-oblivious external-memory algorithm that succeeds 
with probability at least 1 — (N/B)^'^, for any given constant d > 0, assuming only that B > 1 and 
M > 2B. 
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Quantiles. Let us now consider the problem of selecting q quantile elements from an array A using an 
external-memory data-oblivious algorithm, for the case when q < {M/BY/^, which will be sufficient for 
this algorithm to prove useful for our external-memory sorting algorithm, which we describe in Section [5l 

If {M/B) > {N/BY''^, then we sort A using the deterministic data-oblivious algorithm of Lemma pi 
which uses 0{N/ B) I/Os in this case. Then we simply read out the elements at ranks that are multiples of 
N/{q + 1), rounded to integer ranks. 

Let us therefore suppose instead that {M/B) < {N/By/'^; so q < {N/Bf''^^. In this case, we 
randomly choose each element of A to belong to a random subset, C, with probability With high 

probability, there are at most N^^^ it N^/"^ such elements, so we can compact them into an array, C, of 
capacity 

Ar3/4 _^ ^1/2 Theorem^ using 0{N/B) I/Os. We then sort this array and compact it down to 
capacity N^/'^ + A^^/^, using Lemma [2j and let |C| denote its actual size (which we remember in private 
memory). We then scan this sorted array, C, and read into Alice's memory every item, Xj, with rank that is 
a multiple of {n/{q + 1)) — A^^/^, rounded to integer ranks, where n = N^^^, with the exception that xi 
is the smallest element in A. We also select items at ranks, = |C| — (N^^^ — N^/^i/{q + 1) — 2A^^/^), 
rounded to integer ranks, with the exception that i/q is the largest element in A. Let [xi,yi] denote each such 
pair of such items, which, as we show below, will surround a value, n/{q + 1), with high probability. In 
addition, as we also show in the analysis, there are at most 8A^'^/'^ elements of A in each interval [x,, yi], 
with high probability. That is, there are at most 0{N^^/^^) elements of A in any interval [xi,yi], with high 
probability. Storing all the [xj, yi] intervals in Alice's memory, we scan A to identify for each element in 
A if it is contained in such an interval [xi, yi], marking it with i in this case, or if it is outside every such 
interval. During this scan we also maintain counts (in Alice's memory) of how many elements of A fall 
between each consecutive pair of intervals, [xi, yi] and [xj+i, yi+i], and how many elements fall inside each 
interval [xi,yi]. We then compact all the elements of A that are inside such intervals into an array D of 
size 0{N^^/^^) using Theorem [4j using 0{N/B) I/Os, and we pad this array with dummy elements so 
the number of elements of D in each interval [xi , yi] is exactly [8A^^/^] . We then sort D using the data- 
oblivious method of Lemmabl Next, for each subarray of D of size [8A^^/^] we use the selection algorithm 



of Theorem 13 to select the kith, smallest element in this subarray, where ki is the value that will return the 



iih quantile for A (based on the counts we computed during our scans of A). 

Lemma 14: The number of elements of A in C is more than A^3/4_|_jYi/2 yyjfh probability at most e~^^^^/^, 
and the number of elements of A in C is less than N^^'^ — N^^'^ with probability at most e^^^^^l"^ . 

Proof: Let X denote the number of elements of A chosen to belong to C. Noting that each element of A is 
chosen independently with probability to belong to C, by a standard Chemoff bound (e.g., see fMl, 

Theorem 4.4), 

Pr(X > + ^1/2) ^ p^^jsT > (1 + iV-l/4)^3/4) 

since E{X) = n^/^. Also, by another standard Chemoff bound (e.g., see ||34| . Theorem 4.5), 

Pr(X < 7V3/4 _ ^1/2) ^ p^^^ ^ (J _ ^-1/4)^3/4) 



We then compress these elements into an array, C, of size assumed to be at most N^l^ + N^l"^ by 
TheoremkI using 0{N/ B) I/Os, which will fail with probability at most for any given constant 
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c > 0. Given the array C, we select items at ranks Xi = {hi/{q + 1)) — N^^"^, rounded to integer ranks, 
where n = N^/'^. We also select items at ranks, yi = \C\ - (A^^/^ - N^/'^i/{q + 1) - 2iV^/^), rounded to 
integer ranks. Let [xj, j/j] denote each such pair, with the added convention that we take xi to be the smallest 
element in A and yq to be the largest element in A. 

Lemma 15: There are more than 8N^^^ elements of A in any interval [xi,yi] with probability at most 

Proof: Let us assume that A/"^/^ - iV^/^ < \C\ < iV^/^ + N^/^, which hold with probabihty at least 
1 — 2e^^^^*/"^. Thus, there are at most 4A^^/^ elements in C that are in any [xi, yi] pair, other than the first 
or last. Thus, in such a general case, the number of elements, X, from A in this interval has expected value 
E{X) < AN^/^. In addition, since X is the sum of geometric random variables with parameter 

The probability bounds for the first and last intervals are proved by a similar argument. ■ 

In addition, note that there are at most (N/B)^/^^ intervals, [xi, yi]. Also, we have the following. 

Lemma 16: Interval [xk, yt] contains the kth quantile with probability at least 1 — 2e~^^^*^^. 

Proof: Let us consider the probabihty that the A;th quantile is less than Xk- In the random sample, x^ has 

rank hk/{q + 1) - ArV2 ^ N^/*k/{q + 1) - N'^I'^. Thus, the number, X, of elements from A less than 
this number has expected value E{X) = Nk/{q + 1) — N^^'^. Since X is the sum of geometric random 
variables with parameter N~^/'^, we can bound Pr {X > Nk/{q + 1)) by 

Pr (x > (ArV4 ^ ^) . (^N^ _ ^1/2^^ 

< g-(rJV-i/'') = (iV='/*fc/(g+l) - JVi/=)/3 
_ .-T^JV-i/2(iV3/''fc/(q+l)-„i/2)/3 

where r = N^/'^/{N^I'^k/{q + 1) - N^l'^), which greater than 1. So 

Pr(X>ArA;(g + l)) < e--^"''^''V3 

By a symmetric argument, the kih quantile is more than y^ by this same probability. ■ 

Thus, each of the intervals, [x^, y^], contains the kth quantile with probability at least 1 — 2ge~^^^*/^. 
Therefore, we have the following. 

Theorem 17: Given an array. A, of N comparable items, we can select theq < (M/B)^/'*' quantiles in A 
using 0{N/ B) I/Os with a data-oblivious external-memory algorithm that succeeds with probabihty at least 
1 — {N/B)~'^, for any given constant d > 0, assuming only that B > 1 and M > 2B. 
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5 Data-Oblivious Sorting 



Consider now the sorting problem, where we are given an aiTay, A, of N comparable items stored as key- 
value pairs and we want to output an array C of size holding the items from A in key order. For inductive 
reasons, however, we allow both A and C to be arrays of size 0{N) that have non-empty cells, with the 
requirement that the items in non-empty cells in the output array C be given in non-decreasing order. 



We begin by computing q quantiles for the items in A using Theoremhv^ for q = {M/B)^/'^. We then 
would like to distribute the items of A distributed between all these quantiles to g + 1 subarrays of size 
0{N/q) each. Let us think of each subgroup (defined by the quantiles) in A as a separate color, so that each 
item in A is given a specific color, 1, 2, . . . , g + 1, with there being \N/ {q + 1)] items of each color. 



Multi-way Data Consolidation. To prepare for this distribution, we do a + l)-way consolidation of A, 
so that the items in each block of size B in the consolidated array A' are all of the same color. We perform 
this action as follows. Read into Alice's memory the first q + I blocks of A, and separate them into q + 1 
groups, one for each color. While there is a group of items with the same color of size at least B, output a 
block of these items to A'. Once we have output q' such blocks, all the remaining colors in Alice's memory 
have fewer than B members. So we then output q + l — q' empty blocks to A', keeping the "left over" items 
in Alice's memory. We then repeat this computation for the next q + I blocks of A, and the next, and so on. 
When we complete the scan of A, all the blocks in A' will be completely full of items of the same color or 
they will be completely empty. We finish this {q + l)-way consolidation, then, by outputting {q + 1) blocks, 
each containing as many items of the same color as possible. These last blocks are the only partially-full 
blocks in A'. Thus, all the blocks of A' are monochromatic and all but these last blocks of A' are full. 



Shuffle-and-Deal Data Distribution. Our remaining goal, then, is to distribute the (monochromatic) 
blocks of A' to q + 1 separate arrays, Ci, . . ., Cg+i, one for each color. Unfortunately, doing this as a 
straightforward scan of A' may encounter the colors in a non-uniform fashion. To probabilistically avoid 
this "hot spot" behavior, we apply a shujfle-and-deal technique, where we perform a random permutation 
to the n' = 0{N/B) blocks in A' in a fashion somewhat reminiscent of Valiant-Brebner routing L43J . The 
random permutation algorithm we use here is the well-known algorithm (e.g., see Knuth ||28]| ). where, for 
i = 1, 2, . . . , n', we swap block i and a random block chosen uniformly from the range [i, n']. This is the 
"shuffle," and even though the adversary. Bob, can see us perform this shuffle, note that the choices we make 
do not depend on data values. Given this shuffled deck of blocks in A', we then perform a series of scans to 
"deal" the blocks of A' to the q + 1 arrays. 

We do this "deal" as follows. We read in the next (M/B)'^/^ blocks of A'. Note that, w.v.h.p., there 
should now be at most 0((M/i?)^/^) blocks of each color now in Alice's memory. So we write out 
c{M/Bf/^ blocks to each d in Bob's memory, including as many full blocks as possible and padding 
with empty blocks as needed (to keep accesses being data-oblivious), for a constant c determined in the 
analysis. Then, we apply Theorem [8] which implies a success with high probability, to compact each d 
to have size 0{N/q) = 0{N / {M / By/"^) each, which is 0{N/{qB)) blocks. We then repeat the above 
computation for each subarray, d. 

Data-Oblivious Failure Sweeping. We continue in this manner until we have formed 0(n^/^) subarrays 
of size 0(n^/^) each, where n = N/B. At this point, we then recursively call our sorting algorithm to 
produce a padded sorting of each of the subarrays. Of course, since we are recursively solving smaller 
problems, whose success probabihty depends on their size, some of these may fail to correctly produce a 
padded sorting of their inputs. Let us assume that at most 0(n^/^) of these recursive sorts fail, however, and 
apply Theoremp^to deterministically compact all of these subarrays into a single array, D, of size 0(n^/^), 
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in 0{nlog.^ n) time, where m = M/B. We then apply the deterministic data-oblivious sorting method of 
Lemma|2]to D, and then we perform a reversal of Theorem[6]to expand these sorted elements back to their 
original subarrays. This provides a data-oblivious version of the failure sweeping technique [20 1 and gives 
us a padded sorting of A w.v.h.p. 

Finally, after we have completed the algorithm for producing a padded sorting of A, we perform a tight 
order-preserving compaction for all of A using Theorem [6] Given appropriate probabilistic guarantees, 
given below, this completes the algorithm. 

Lemma 18: Given (M/B)^/^ blocks of A', read in from consecutive blocks from a random permutation, 
more than c{MjB')^l'^ of these blocks are of color, x, for a given color, with probability less than 
{N/B)^'^, for c > 2de/e^, where e is the constant used in the wide-block and tall-cache assumptions. 

Proof: Let X be the number of blocks among (M/B)'^/'^ blocks chosen independently without replacement 
from A' that are of color x and let Y be the number of blocks among (M / B)^/^ blocks chosen independently 
with replacement from A' that are of color x- By a theorem (4) of Hoeffding ||26]| . 



Pr(X > c{M/Bff'^) < Pr(y > c{M / Bf 



Thus, since E{Y) = (M/i3)^/^, then, by a Chemoff bound (e.g.. Lemma 22 from the appendix), 

Pr(y > c{M/Bfl'^) < 2-cWS)i/2 < ^n/bY, 

for c > 2de/€^. ■ 

This bound applies to any (M/B)^/'^ blocks we read in from A', and any specific color, x- Thus, we 
have the following. 

Corollary 19: For each set of (M/ B)^^^ blocks of A', read in from consecutive blocks from the constructed 
random permutation, more than c{M/B)^/'^ of these blocks are of any color, x, witti probability less than 
(N/B)^'^, for c > 2d!e/e^, where e is the constant used in the wide-block and tall-cache assumptions and 
d' >d+l. 

Proof: There are {N/B)/{M/Bf/* sets of {M/Bf/"^ blocks of A' read in from consecutive blocks from 
the constructed random permutation. In addition, there are (Af/i?)^/^^ colors. Thus, by the above lemma 
and the union bound, the probability that any of these sets overflow a color x is at most 



N/B 



{M/Bf/^ {N/BY 

ford' >d + l. 



{M/Bfl^^ < {N/BY 



So we write out c{M/ B^^"^ blocks to each Ci, including as many full blocks as possible and padding 
with empty blocks as needed. We then repeat the quantile computation and shuffle- and-deal computation on 
each of the Cj's. 

At the point when subproblem sizes become of size 0(n^/^), we then switch to a recursive computation, 
which we claim inductively succeeds with probability 1 — l/n'^/^, for any given constant d > 2, for n = 
N/B. 



Lemma 20: There are more than n^/^ failing recursive subproblems with probability at most 2 " 



1/4 
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22 



from 



Proof: Each of the recursive calls fails independently. Thus, if we let X denote the number of failing 
recursive calls, then E{X) < n)-!'^ jn'^l'^ = n^^'^^^^/'^. Thus, by a Chernoff bound (e.g.. Lemma 
the appendix), 

Pr(X>nV4) = Pr(X>ni/4 + (^-i)/2.n(^"i)/2) 



Thus, the failure sweeping step in our sorting algorithm succeeds with high probability, which completes 
the analysis and gives us the following. 

Theorem 21: Given an array, A, of size N, we can perform a data-oblivious sorting of A wifh an algo- 
rithm that succeeds with probability 1 — 1/{N/B)'^ and uses 0{{N/B) logj^.j /^{N/B)) I/Os, for any given 
constant d > 1, assuming that B > \og^ (N/B) and M > B^'^'^, for some small constant e > 0. 



6 Conclusion 

We have given several data-oblivious algorithms for fundamental combinatorial problems involving out- 
sourced data. It would be interesting to know if a linear-I/0 data-oblivious algorithm is possible for data 
compression without the wide-block and tall-cache assumptions (in Appendix |B] we show how to relax this 
wide-block assumption, albeit at the cost of an extra log-star factor in the I/O performance). It would also 
be interesting to investigate data-oblivious algorithms for graph and geometric problems in this same model, 
as such computations are well-motivated for large outsourced data sets. 
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A Some Chernoff Bounds 



Several of our proofs make use of Chernoff bounds (e.g., see lH [161 [34l [35l [36l for other examples), which, 
for the sake of completeness, we review in this subsection. We begin with the following Chernoff bound, 
which is a simplification of a well-known bound. 

Lemma 22: Let X = Xi + X2 + ■ ■ ■ + Xn be the sum of independent 0-1 random variables, such that 
Xi = 1 with probability Pi, and let fi > E{X) = Y^^=i Pi- Then, for^ > 2e, 

Pr(X >7^) < 2-^'^i°s(t/^). 

Proof: By a standard Chernoff bound (e.g., see ||4l[l6l[3l[35l|36l) and the fact that 7 > 2e, 

.7-1 X'^ 



Pr {X > 7/i) < 



< 



e 
e 



7/i 



7. 

2-7/'log(7/e)_ 



There are other simplified Chernoff bounds similar to that of Lemma 22 (e.g., see ||4l|34l|35l[36l), but 
they typically omit the log(7/e) term (where the log is base-2, of course). We include it here, since it is 
useful for large 7, which will be the case for some of our uses. Nevertheless, we sometimes leave off the 



log(7/e) term, as well, in applying Lemma 22 if that aids simplicity, since log(7/e) > 1 for 7 > 2e. 

In addition, we also need a Chernoff bound for the sum, X, of n independent geometric random variables 
with parameter p, that is, for X being a negative binomial random variable with parameters n and p. Recall 
that a geometric random variable with parameter p is a discrete random variable that is equal to j with 
probability q^^^p, where q = I — p. Thus, E{X) = an, where a = 1/p. 

Lemma 23: Let X = Xi + X2 + • • • + Xn be the sum of n independent geometric random variables with 
parameter p. Then we have the following: 

• IfO<t< a/2, then Pr(X > (a + t)n) < e-(*P)'"/3. 

• Ift > a/2, then Pr(X > (a + t)n) < e-*?"/^. 

• Ift > a, then Fi{X > {a + t)n) < e-*^"/^. 

• Ift > 2a, then Pr(X > (a + t)n) < e-*^"/^. 

• Ift > 3a, then Ft{X > {a + t)n) < e-*P"/2. 

Proof: We follow the approach of Mulmuley f36\, who uses the Chernoff technique (e.g., see fT6l [34l[35l 
[36,1 ) to prove a similar result for the special case when p = 1/2 and t > 6 (albeit with a slight flaw, which 
we fix). For < A < ln(l/(l - p)), 

00 
00 
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oo 
j=0 



pe 



1 — e^q ' 

Applying the Chemoff technique, then, 



Pr(X > {a + t)n) < e-^(°^+*)'^ (l^^) 



e 



-X{a+t-l) " 



1 - e 



Let P = p/{l — p) and observe that we can satisfy the condition that < A < ln(l/(l — p)) by setting 

a + t' 



= 1 + 



By substitution and some calculation, note that 



e- = 1 



a + t + pt 
and 

^ 1 + tp 

Thus, we can bound Pi{X > (a + t)n) by 



V a + t + ptj ^ 



Moreover, since 1 — x < e ^, for all a;. 



1 < e «+*+^* = e~*^. 



Therefore, 

Pr(X > (a + t)n) < e'^^ (1 + tp)" . 

Unfortunately, if we use the well-known inequality, 1 + .t < e^, with x = tp, to bound the "1 + tp" term in 
the above equation, we get a useless result. So, instead, we use better approximations: 

• If < a; < 1, then we can use a truncated Maclaurin series to bound 

\n{l+x)<x- — + —. 



2 3 



Thus, 

1 + X < e"^"^^^, 
which implies that, for < t < a/2, 

/e(*P)V3\" 

Pr(X > (a + t)n) < 
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• The remaining bounds follow from the following facts, which aie easily verified: 

1. If X > l/2,then 1 + a; < e^'/(i+V8)_ 

2. If X > 1, then 1 + x < e^/^^+i/^). 

3. If X > 2, then 1 + x < e^'/(i+i/2)_ 

4. If X > 3, then 1 + x < e^/^. 

■ 

Incidentally, the bound for t> 3a fixes a slight flaw in a Chernoff bound proof by Mulmuley ll36]| . 

B A Slightly Super-Linear Algorithm for Data-Oblivious Loose Compaction 

In this appendix, we provide a proof for Theorem|9j which states that, given an array. A, of size A^, such that 
at most R < N/4 of A's elements are marked as distinguished, we can compress the distinguished elements 
in A into an array B of size 4.25i? using a data-oblivious algorithm that uses 0{{N/B) log*{N/B)) I/Os 
and succeeds with probability at least 1 — 1/{N/BY, for any given constant d > 1, assuming only that 
B >1 and M > 2B. This method does not necessarily preserve the order of the items in A. 

Proof: We begin by performing the consolidation operation of Lemma[3} so that each block in A, save one, 
is completely full or completely empty. We then perform the following RAM algorithm at the granularity 
of blocks, so that each memory read or write is actually an I/O for an entire block. That is, we describe a 
RAM algorithm that runs in 0(n log* n) time, with high probability, where n = N/B and r = R/B, and 
we implement it for the external-memory model so as to perform each read or write of a word as a read or 
write of a block of size B. 

So, suppose we are given an array A of size n, such that at most r < n/4 of the elements in A are 
marked as "distinguished" and we want to map the distinguished elements from A into an array D, of size 
4.25r using an algorithm that is data-oblivious. 

Our algorithm begins by testing if n is smaller than a constant, no, determined in the analysis, in which 
case we compact A using the deterministic data-oblivious sorting algorithm of Lemma |2] In addition, we 
also test if r < n/ log^ n, in which case we use Theorem |4] to compact A into D. (Note that in either of 
these base cases, we can compact the at most r distinguished elements of A into an array size exactly r.) 
Thus, let us assume that r > n/log^ n and n > no. 

Our algorithm for this general case is loosely based on the parallel linear approximate compaction al- 
gorithm of Matias and Vishkin 11321 and proceeds with a number of phases. In this general case, we restrict 
D to its first 4r cells, leaving its last 0.25r cells for later use. At the beginning of each phase, i, we assume 
inductively that there are at most 

r 

distinguished elements left in A, where ti = 2^ and 

ti+i = 2*% 

for i > 1. That is, the tj's form the tower-of-twos sequence; hence, there are at most 0(log* n) phases. 

We begin our data-oblivious algorithm for compacting the distinguished elements in A into D by per- 
forming Co A-to-D thinning passes, for a constant, cq, determined by the following lemma. 

Lemma 24: Performing an initial cq A-to-D thinning passes in the general-case algorithm results in there 
being at most r/tf distinguished elements in A, with probability at least 1 — n^'^, for any given constant 
d> 1, provided cq >2^. 
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Proof: If we perform cq > 2^ initial A-to-D thinning passes, then the expected number of distinguished 
elements, /i, still in A will be bounded by 



r 



216' 

since the probability of a collision in each pass is at most 1/4. Moreover, we can bound the number of 
failing distinguished elements by a sum, X, of r independent 0-1 random variables, each of which is 1 with 
probabiUty 1/2^^. Since ii = 2^ = 4, 



Pr(X>4l = Pv(x>^ 



Pr (X > 2V) , 



which we can bound using the Chemoff bound of Lemma 22 



as 



Pr > 2V) < 2-256M. 
In addition, note that since r > n/ log^ n. 



256r 

256/i = 



> 



216 

256n 



216 log2 n 
> dlogn, 

provided we set no so that the last inequality is true whenever n > no- Thus, the probability of failure is at 
most n^'^. ■ (for Lemma 24 1 

So, let us assume inductively that at the beginning of phase i there are at most r/tf distinguished ele- 
ments remaining in A. If, at the beginning of phase i, 

r n 
< 



tf log^n' 

then we apply Theorem|4]to compress the remaining distinguished elements of A into the last 0.25r < r/tf 
cells of D, which completes the algorithm. That is, this results in a compaction of A into an anay D of size 
4.25r. 

If, on the other hand, r/tf > n/log^ n, then we continue with phase i, which we divide into two steps, 
a thinning-out step and a region-compaction step. 



The Thinning-Out Step. In the thinning-out step we create an auxiliary array, C, of size r/ti, and we 
perform two yl-to-C thinning passes, which will fail to map any remaining distinguished item in A into C 
with probability at most = l/tf,by our induction hypothesis. Given this mapping, we then perform 

ti C-to-D thinning passes, so that the probability that any distinguished item in C fails to get mapped to D 
is at most 1/2^**. Since there may be some items originally in A that are now stuck in C, we grow A by 
concatenating the current copy of A with C. This grows the size of A to be n + J2]=i '"/^j' which is less 
than n + r/2 even accounting for all the times we grew A in previous phases. In addition, note that the total 
time to perform all these passes is 0(n + r). 

At this point in phase i, for each distinguished element that was in A at the beginning of the phase, we 
can bound the event that this element failed to get mapped into D with an independent indicator random 
variable that is 1 with probability l/tf, since tf < 2^*\ 
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The Region-Compaction Step. In this step of phase i, we divide A into contiguous regions, Ai, A2, and 
so on, each spanning 2^** cells of A. Let us say that such a region, Aj, is over-crowded if it contains more 
than Vi = 2^*»/t? distinguished items. We note the following. 

Lemma 25: The probability that any given region, Aj, is over-crowded is less than 2~^'' . 



Proof: Given the way that the thinning pass is performed, we can bound the probability of failure of items 
being copied by independent indicator random variables that are 1 with probability l/tf. We can bound 
expected value, /i, of the sum, X, of these random variables with 



IF 



We can use the Chemoff bound of Lemma 22 and the fact that tf > 2*^, then, to bound 



Prix > 



24tr 



Pr (X > tf// 



< 2-*> 



< 



-4H 



since 22*» > tf, for U > 2^. 



I (for Lemma 25 1 



So, for each region, Aj, we apply Theorem |4] to compact the distinguished elements in Aj in an obliv- 
ious fashion, assuming Aj is not over-crowded. Note that this method runs in 0(|^j| + rj log^ rj) time, 
which is 0(|Aj|). Moreover, assuming that Aj is not over-crowded, this procedure fails for any such Aj 
(independently) with probability at most 

1 



which is at most 



1 



since tf < 2^*% for tj > 2^. 

Next, for each subarray, A'j, consisting of the first r cells in a region Aj, we perform tf A'j-to-D thinning 
passes. Thus, we can model the event that a distinguished member of such an A'j fails to be inserted into D 
with an independent indicator random variable that is 1 with probability at most 



In addition, we have the following. 

Lemma 26: The total number of distinguished elements that belong to a subarray A'j yet fail to be mapped 
to D is more than r/(2 • 2^**) with probability at most 1/(2 • n'^), for any given constant d > 1. 



Proof: There are at most r distinguished elements remaining in A at the beginning of phase i that could 
possibly belong to a subarray Aj . The probability that these fail to be mapped to D can be characterized 



24 



with a sum, X, of independent indicator random variables, each which is 1 with probabiUty at most 2 
Thus, the expected value of X can be bound by 



r 



and we can apply the Chemoff bound of Lemma 22 as follows: 



,2*2 



Fr(x> ^ ,A < Pr f X > ^ u 
V 2 -24*.; - \^ 2-24*.^^ 

Moreover, since r/tf < n/log^ n, we have that tj < 0.25 log^^^ n. Thus, 

r n 

> 



2-2^*' log2n-2-2V^^ 
> dlogn + 1, 



for n larger than a sufficiently large constant, no- ■ (for Lemma 26 1 

Of course, a distinguished member of A may fail to be mapped into an A'j subarray, either because it 
belongs to an over-crowded region or it belongs to a region that fails to be compacted. Fortunately, we have 
the following. 

Lemma 27: The number of distinguished elements in A that fail to be mapped into D in phase i, either 
because they belong to an over-crowded region or because their region fails to be compacted, is more than 
r/(2 • 2^*') with probability at most I / {2n'^) . 

Proof: Recall that the probability that a region Aj is over-crowded is at most 2""^** and the probability 
that a non-over-crowded region Aj fails to be mapped to a compressed region, A'^, is at most 2~^*'. Let us 
consider the latter case first. 

Let X denote the number of distinguished elements that belong to non-over-crowded regions that fail 
to be compacted in phase i. Note that the number of non-empty regions is at most r/tf and the number 
of distinguished items per such region is clearly no more than 2^*\ Thus, if we let Y be the number of 
non-empty non-over-crowded regions that fail to be compressed, then 

X < 2^*'y. 

Moreover, since Y is the sum of at most r/tf independent indicator random variables, and 

EiY)<fi = r/{tf2^'^), 
we can use the Chernoff bound from Lemma |22] as follows: 

Pi(x > . ^ ,A < Pr f y > 



< Pr (y > 2*'-2^4^j 

^ 2-V(log^"-2^v^^°^) 
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since ti < 0.25 log^^^ n and tf > 2*. Note that, for n greater than a sufficiently large constant, no, 

> dloff n + 2, 



ign 



log2 n ■ 22\/l^ 

for any fixed constant d > 1. Thus, the number of distinguished elements that belong to non-over-crowded 
regions that fail to be compacted in phase i is more than r/(4 • 2^*') with probability at most l/(4n°'). By 
a similar argument, the number of distinguished items that belong to over-crowded regions is more than 
r/(4 • 2^**) with probability at most l/{4n'^). In fact, this case is easier, since ti > 4 implies 2~^ ' < 2^*\ 
Therefore, the number of distinguished elements that fail for either of these reasons is more than r/(2 • 2'^*') 



with probability at most 1 / {2n'^) . ■ (for Lemma 27 1 



Thus, combining this lemma with Lemma |26} we have that the number of distinguished elements in A 
that are not mapped (data-obliviously) into D in phase i is at most 



r 



with probability at least 1 — which gives us the induction invariant for the next phase. This completes 
the proof (noting that we can apply the above algorithm with d' = d + 1, since there are 0(log* n) phases, 
each of which runs in 0(n) time and fails with probability at most 1 / n'*'). ■ (for Theoremjojl 
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